Note: Some codes adapted from the lecture silds written by Ruth Isserlin. To compile this markdown file, please include the data and figures folder and place them at the same directory with this file.
For this assignment, I used Transcriptional profile of human STAT1-/- fibroblasts expressing LY6E or empty control vector data GSE111958 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111958) which about if LY6E will enhance the infectivity of some viruses(Mar KB 2018). In assignment 1, I had already cleaned and normalized the data, in this assignment, I will do Differential Gene Expression analysis and Thresholded over-representation analysis.
Boxplot and density plot from A1
First of all, use the following command to launch Rstudio in Docker.
docker run -e PASSWORD=1234 --rm -p 8787:8787 -v /Users/bfx/Documents/BCB420:/home/rstudio/docker_bcb420 risserlin/bcb420-base-image
Use the normalized data from assignment 1.
normalized_count_data <- read.table(file=file.path("data", "GSE111958_finalized_normalized _counts.txt"),
header = TRUE,sep = "\t",
stringsAsFactors = FALSE,
check.names=FALSE)
kable(normalized_count_data[1:5, ], type="html")
| GENE_NAME | SEU227 | SEU235 | SEU239 | SLU227 | SLU235 | SLU239 |
|---|---|---|---|---|---|---|
| NOC2L | 36.480486 | 28.428988 | 38.926023 | 28.968916 | 25.511290 | 26.148589 |
| AGRN | 13.259045 | 9.391719 | 15.111525 | 9.837038 | 12.755645 | 9.444837 |
| C1orf159 | 6.375934 | 6.091926 | 7.001937 | 4.841062 | 4.299656 | 8.248766 |
| SDF4 | 9.201632 | 5.801834 | 11.353423 | 6.312745 | 5.876196 | 7.671353 |
| B3GALT6 | 13.440179 | 13.235435 | 13.450049 | 12.548033 | 11.716562 | 10.929615 |
# Create the numerical matrix for heatmap
heatmap_matrix <- normalized_count_data[ , 2:ncol(normalized_count_data)]
rownames(heatmap_matrix) <- normalized_count_data$GENE_NAME
colnames(heatmap_matrix) <- colnames(normalized_count_data[, 2:ncol(normalized_count_data)])
# Create the heatmap
if(min(heatmap_matrix) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix)), c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix), 0, max(heatmap_matrix)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix),
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE
)
without_row_normalization_heatmap<- current_heatmap
Draw the heatmap directly, we can see that the range of the data is too large and the most value are distributed at low counts, so the whole heatmap is blue and only the top is red.
# Row-normalization
heatmap_matrix <- t(scale(t(heatmap_matrix)))
if(min(heatmap_matrix) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix)), c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix), 0, max(heatmap_matrix)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix),
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE
)
row_normalization_heatmap <- current_heatmap
Scale each row and centre the around the mean, so we can narrow down the range of the data to make the heatmap more easy to read.
empty_samples <- grep(colnames(normalized_count_data),
pattern="^SEU")
LY6E_samples <- grep(colnames(normalized_count_data),
pattern="^SLU")
gene_of_interest <- which(normalized_count_data$GENE_NAME == "LY6E")
# LY6E expression in empty samples
LY6E_empty_samples <- t(normalized_count_data
[gene_of_interest,
empty_samples])
colnames(LY6E_empty_samples) <- c("empty_samples")
LY6E_empty_samples
## empty_samples
## SEU227 8.404640
## SEU235 5.801834
## SEU239 7.397527
# LY6E expression in LY6E samples
LY6E_L_samples <- t(normalized_count_data
[gene_of_interest,
LY6E_samples])
colnames(LY6E_L_samples) <- c("LY6E_samples")
LY6E_L_samples
## LY6E_samples
## SLU227 2465.263
## SLU235 1964.477
## SLU239 1869.830
# Using a simple t.test compare STAT1 gene.
# The null hypothesis of the two sample t-test is that there is no difference in means of each sample
# It assumes that both stat1_LY6E_samples and stat1_empty_samples are normally distributed.
t.test(x=t(LY6E_L_samples),y=t(LY6E_empty_samples))
##
## Welch Two Sample t-test
##
## data: t(LY6E_L_samples) and t(LY6E_empty_samples)
## t = 11.328, df = 2.0001, p-value = 0.007702
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1297.825 2887.485
## sample estimates:
## mean of x mean of y
## 2099.856560 7.201334
Now we can revisit the MDSPlot with limma package
rep_colors <- rainbow(3)
rep_colors<- rep(rep_colors, 2)
limma::plotMDS(heatmap_matrix, col=rep_colors)
Define the groups, The first 3 character represent the conditions (Empty control or LY6E)
samples <- data.frame(lapply(colnames(normalized_count_data)[2:7],
FUN=function(x){unlist(strsplit(x, perl=TRUE, "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])"))}))
colnames(samples) <- colnames(normalized_count_data)[2:7]
rownames(samples) <- c("cell_type", "rep")
samples <- data.frame(t(samples))
samples
model_design <- model.matrix(~ samples$cell_type)
kable(model_design, type="html")
| (Intercept) | samples$cell_typeSLU |
|---|---|
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
# Create data matrix
expressionMatrix <- as.matrix(normalized_count_data[,2:7])
rownames(expressionMatrix) <- normalized_count_data$GENE_NAME
colnames(expressionMatrix) <- colnames(normalized_count_data)[2:7]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
# Fit the data to the model
fit <- lmFit(minimalSet, model_design)
# Apply empircal Bayes to compute differential expression for the above described model.
fit2 <- eBayes(fit,trend=TRUE)
topfit <- topTable(fit2,
coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
# Merge hgnc names to topfit table
output_hits <- merge(normalized_count_data[,1],
topfit,
by.y=0,by.x=1,
all.y=TRUE)
# Rename the colnames
colnames(output_hits)<-c("hgnc_symbol", colnames(output_hits)[2:ncol(output_hits)])
# sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
kable(output_hits[1:10,],type="html")
| hgnc_symbol | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|
| 5213 | LY6E | 2092.65523 | 1053.52895 | 14.501502 | 0.0000007 | 0.0084803 | 0.1230086 |
| 2949 | EIF5A | -97.89656 | 377.62546 | -9.667372 | 0.0000144 | 0.0794557 | -0.3053602 |
| 8132 | RPS10 | -75.05683 | 314.78603 | -9.179320 | 0.0000208 | 0.0794557 | -0.3808356 |
| 9378 | SZRD1 | -36.91134 | 189.28848 | -8.549779 | 0.0000344 | 0.0950860 | -0.4930467 |
| 968 | BGN | -39.04973 | 68.22038 | -8.323150 | 0.0000416 | 0.0950860 | -0.5381532 |
| 7828 | RCC2 | -39.92165 | 219.98396 | -7.769751 | 0.0000671 | 0.1163271 | -0.6604712 |
| 6096 | NDST1 | -40.52974 | 170.06888 | -7.702479 | 0.0000712 | 0.1163271 | -0.6766295 |
| 557 | ARHGAP1 | -19.49320 | 54.61903 | -7.099849 | 0.0001243 | 0.1775787 | -0.8356092 |
| 2067 | COL1A1 | -309.90818 | 644.96489 | -6.714755 | 0.0001808 | 0.2090339 | -0.9522483 |
| 3610 | FST | 23.54229 | 85.83211 | 6.703419 | 0.0001828 | 0.2090339 | -0.9558796 |
the number of genes pass the threshold p-value < 0.05
length(which(output_hits$P.Value < 0.05))
## [1] 1476
the number of genes pass correction
length(which(output_hits$adj.P.Val < 0.05))
## [1] 1
Account for the rep variability to improve the results
model_design_rep <- model.matrix(
~ samples$rep + samples$cell_type)
kable(model_design_rep,type="html")
| (Intercept) | samples$rep235 | samples$rep239 | samples$cell_typeSLU |
|---|---|---|---|
| 1 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| 1 | 0 | 0 | 1 |
| 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 |
Fit the data to the new model
fit_rep <- lmFit(minimalSet, model_design_rep)
Apply empircal Bayes to compute differential expression for the new model
# Since my data is RNA-seq data, we set the parameter trend=TRUE
fit2_rep <- eBayes(fit_rep,trend=TRUE)
topfit_rep <- topTable(fit2_rep,
coef=ncol(model_design_rep),
adjust.method = "BH",
number = nrow(expressionMatrix))
# Merge hgnc names to topfit table
output_hits_rep <- merge(normalized_count_data[,1],
topfit_rep,
by.y=0,by.x=1,
all.y=TRUE)
# Rename the colnames
colnames(output_hits_rep)<-c("hgnc_symbol", colnames(output_hits_rep)[2:ncol(output_hits)])
# Sort by pvalue
output_hits_rep <- output_hits_rep[order(output_hits_rep$P.Value),]
kable(output_hits_rep[1:10,],type="html")
| hgnc_symbol | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|
| 5213 | LY6E | 2092.65523 | 1053.52895 | 17.020902 | 0.0000020 | 0.0229303 | -0.2569220 |
| 2067 | COL1A1 | -309.90818 | 644.96489 | -15.028586 | 0.0000043 | 0.0243468 | -0.3220116 |
| 968 | BGN | -39.04973 | 68.22038 | -9.480230 | 0.0000659 | 0.1521454 | -0.7220288 |
| 8132 | RPS10 | -75.05683 | 314.78603 | -9.306232 | 0.0000734 | 0.1521454 | -0.7447125 |
| 2949 | EIF5A | -97.89656 | 377.62546 | -9.270816 | 0.0000751 | 0.1521454 | -0.7494546 |
| 2076 | COL5A1 | -156.39722 | 368.23077 | -8.816718 | 0.0001004 | 0.1521454 | -0.8142812 |
| 4365 | HYOU1 | -59.25213 | 284.60310 | -8.587464 | 0.0001169 | 0.1521454 | -0.8500763 |
| 3503 | FILIP1L | 36.78239 | 151.52313 | 8.579183 | 0.0001175 | 0.1521454 | -0.8514109 |
| 6096 | NDST1 | -40.52974 | 170.06888 | -8.551337 | 0.0001198 | 0.1521454 | -0.8559201 |
| 9378 | SZRD1 | -36.91134 | 189.28848 | -8.006140 | 0.0001747 | 0.1602140 | -0.9514484 |
the number of genes pass the threshold p-value < 0.05
length(which(output_hits_rep$P.Value < 0.05))
## [1] 1865
the number of genes pass correction
length(which(output_hits_rep$adj.P.Val < 0.05))
## [1] 2
Draw MA plot, The point looks too few, but I checked dim(topfit) which is 11433, 6. Seems like the range of the data is too large.
differential_expression_results <- topfit[rownames(minimalSet), ]
differential_expression_status <- apply(differential_expression_results, 1, function(gene){
if (gene["P.Value"] < 0.05) {
if (gene["logFC"] < 0 ){
return("Downregulated genes")
}
else{
return("Upregulated genes")
}
} else {
return("Not differentially expressed")
}
})
limma::plotMD(fit2, column = ncol(fit2), cex = 0.5, status = differential_expression_status,
main = "Differential gene expression")
Compare the results from two different models
simple_model_pvalues <- data.frame(hgnc_symbol = output_hits$hgnc_symbol,
simple_pvalue=output_hits$P.Value)
rep_model_pvalues <- data.frame(hgnc_symbol = output_hits_rep$hgnc_symbol,
rep_pvalue = output_hits_rep$P.Value)
two_models_pvalues <- merge(simple_model_pvalues,
rep_model_pvalues,by.x=1,by.y=1)
two_models_pvalues$colour <- "black"
two_models_pvalues$colour[two_models_pvalues$simple_pvalue<0.05] <- "orange"
two_models_pvalues$colour[two_models_pvalues$rep_pvalue<0.05] <- "blue"
two_models_pvalues$colour[two_models_pvalues$simple_pvalue<0.05 & two_models_pvalues$rep_pvalue<0.05] <- "red"
# plot
plot(two_models_pvalues$simple_pvalue,two_models_pvalues$rep_pvalue,
col = two_models_pvalues$colour,
xlab = "simple model p-values",
ylab ="Rep model p-values",
main="Simple vs Rep Limma")
For the interest gene
hgnc_symbol_of_interest <- "LY6E"
# There are too many overlap around the gene of interest, so I set others genes be transparent
two_models_pvalues$colour <- alpha("grey", 0.2)
two_models_pvalues$colour[two_models_pvalues$hgnc_symbol==hgnc_symbol_of_interest] <- "red"
plot(two_models_pvalues$simple_pvalue,two_models_pvalues$rep_pvalue,
col = two_models_pvalues$colour,
xlab = "Simple model p-values",
ylab ="Rep model p-values",
main="Simple vs Rep Limma (for interest gene)")
# Since there are too many overlap arround the gene of interest, use a point to mark the location of it.
points(two_models_pvalues[two_models_pvalues$hgnc_symbol==hgnc_symbol_of_interest, 2:3], pch=24, col="red", cex=1.5)
Draw the Heatmap with the top hits using Limma (accounting for rep variability) which only show the p-value < 0.05 and columns ordered by cell type.
top_hits <- output_hits_rep$hgnc_symbol[output_hits_rep$P.Value<0.01]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
heatmap_matrix_tophits<- heatmap_matrix_tophits[,
c(grep(colnames(heatmap_matrix_tophits), pattern = "^SEU"),
grep(colnames(heatmap_matrix_tophits), pattern = "^SLU"))]
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE, show_row_dend = TRUE,
cluster_columns = FALSE,show_column_dend = FALSE,
col=heatmap_col,show_column_names = TRUE,
show_row_names = FALSE,show_heatmap_legend = TRUE)
current_heatmap
Obviously, the heatmap show that the gene expression has significant different for different cell type.
filtered_data_matrix <- as.matrix(normalized_count_data[,2:7])
rownames(filtered_data_matrix) <- normalized_count_data$GENE_NAME
# Set up edgeR objects
d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
# Estimate Dispersion
d <- estimateDisp(d, model_design_rep)
# Fit the model
fit <- glmQLFit(d, model_design_rep)
# Calculate differential expression using the Quasi liklihood model
qlf.pos_vs_neg <- glmQLFTest(fit, coef='samples$cell_typeSLU')
kable(topTags(qlf.pos_vs_neg), type="html")
|
|
|
|
# get all results
qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue",
n = nrow(normalized_count_data))
the number of genes pass the threshold p-value < 0.05
length(which(qlf_output_hits$table$PValue < 0.05))
## [1] 498
the number of genes pass correction
length(which(qlf_output_hits$table$FDR < 0.05))
## [1] 4
qlf_rep_model_pvalues <- data.frame(
hgnc_symbol = rownames(qlf_output_hits$table),
qlf_rep_pvalue=qlf_output_hits$table$PValue)
limma_rep_model_pvalues <- data.frame(
hgnc_symbol = output_hits_rep$hgnc_symbol,
limma_rep_pvalue = output_hits_rep$P.Value)
two_models_pvalues <- merge(qlf_rep_model_pvalues,
limma_rep_model_pvalues,
by.x=1,by.y=1)
two_models_pvalues$colour <- "black"
two_models_pvalues$colour[two_models_pvalues$qlf_rep_pvalue<0.05] <- "orange"
two_models_pvalues$colour[two_models_pvalues$limma_rep_pvalue<0.05] <- "blue"
two_models_pvalues$colour[two_models_pvalues$qlf_rep_pvalue<0.05 & two_models_pvalues$limma_rep_pvalue<0.05] <- "red"
# plot
plot(two_models_pvalues$qlf_rep_pvalue,
two_models_pvalues$limma_rep_pvalue,
col = two_models_pvalues$colour,
xlab = "QLF rep model p-values",
ylab ="Limma rep model p-values",
main="QLF vs Limma ")
hgnc_symbol_of_interest <- "LY6E"
two_models_pvalues$colour <- "grey"
two_models_pvalues$colour[two_models_pvalues$hgnc_symbol==hgnc_symbol_of_interest] <- "red"
plot(two_models_pvalues$qlf_rep_pvalue,
two_models_pvalues$limma_rep_pvalue,
col = two_models_pvalues$colour,
xlab = "QLF patient model p-values",
ylab ="Limma Patient model p-values",
main="QLF vs Limma (for interest gene)")
# Since there are too many overlaps arround the gene of interest, use a point to mark the location of it.
points(two_models_pvalues[
two_models_pvalues$hgnc_symbol==hgnc_symbol_of_interest,2:3],
pch=24, col="red", cex=1.5)
Draw the Heatmap with top hit using QLF model
top_hits <- rownames(qlf_output_hits$table)[output_hits_rep$P.Value<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
heatmap_matrix_tophits<- heatmap_matrix_tophits[,
c(grep(colnames(heatmap_matrix_tophits),pattern = "^SEU"),
grep(colnames(heatmap_matrix_tophits),pattern = "^SLU"))]
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = FALSE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
)
current_heatmap
Also obviously, the heatmap show that the gene expression has significant different for different cell type.
# Merge gene names with the top hits
qlf_output_hits_withgn <- merge(normalized_count_data[,1],
qlf_output_hits,
by.x=1, by.y = 0)
colnames(qlf_output_hits_withgn)<-c("hgnc_symbol", colnames(qlf_output_hits_withgn)[2:ncol(qlf_output_hits_withgn)])
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
upregulated_genes <- qlf_output_hits_withgn$hgnc_symbol[
which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC > 0)]
downregulated_genes <- qlf_output_hits_withgn$hgnc_symbol[
which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC < 0)]
all_differentially_expressed_genes <- qlf_output_hits_withgn$hgnc_symbol[
which(qlf_output_hits_withgn$PValue < 0.05)]
We can see there are 498 differentially expressed genes, the most differential expression are downregulated which is 351, and upregulated genes are only 147.
Use g:profiler for thresholded gene set enrichment analysis. Set organism to “Homo sapiens”, significance threshold to “Benjamini-Hochberg FDR” and threshold to “0.05”.
Data sources:
GO biological process: releases/2019-07-01
Reactome annotations: ensembl classes: 2019-10-2
WikiPathways: 20190910
Exprot the genelist for upregulated, downregulated and all differentially expressed genes. Then use these genelist as query separately.
# Export the genelist to txt file for g:profiler
write.table(x=upregulated_genes,
file=file.path("data","LY6E_upregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
file=file.path("data","LY6E_downregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=all_differentially_expressed_genes,
file = file.path("data", "all_differentially_expressed_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
The results of differentially expressed genes
The number of genesets for each data source:
GO:BP: 954
REAC: 239
WP: 16
More details for each data source:
the top hit is cellular component organization, the number of genes found in cellular component organization is 276
the top hit is Axon guidance, the number of gene from query found in Axon guidance is 64
the top hit is Focal Adhesion, the number of gene from query found in Focal Adhesion is 32
For the g:profiler results, the result of downregulated is very similar with the result of all differentially expressed genes. For the whole list, cellular component organization is the top hit and viral process is the third hit in GO:BP which are both related to the infectivity of viruses. The result of upregulated shows that some genes related to cell cycle, Mitochondria and ATP synthesis are upregulated.
This support the conclusions discussed in the original paper that LY6E promotes early step of the virus life cycle and enhances infectivity of some viruses.
Also I found an article “RLR-mediated antiviral innate immunity requires oxidative phosphorylation activity”, published at 2017, said mitochondria is a platform for antiviral innate immunity and antiviral innate immunity requires oxidative phosphorylation (OXPHOS) activity.(Yoshizumi 2017) oxidative phosphorylation (OXPHOS) is the top hit for upregulated genelist in GO:BP and lots of genes related to mitochondria differentially expressed.
Therefore LY6E must play some important role in early step of the virus life cycle and infectivity of some viruses.
Mar KB, Boys IN, Rinkenberger NR. 2018. “LY6E Mediates an Evolutionarily Conserved Enhancement of Virus Infection by Targeting a Late Entry Step.” Nat Commun 9 (11): 3603.
Yoshizumi, Imamura, T. 2017. “RLR-Mediated Antiviral Innate Immunity Requires Oxidative Phosphorylation Activity.” Sci Rep 7: 5379.